-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
incremental scd2
with merge_key
#1818
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
assert_load_info(info) | ||
assert load_table_counts(p, "dim_test")["dim_test"] == 3 | ||
boundary_ts = get_load_package_created_at(p, info) | ||
# natural key 1 should now have two records (one retired, one active) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also test here that when the row with natural key 1 reverts back to the first version (and has the same row hash as it had in the first load) it will still create a new entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlt/common/schema/typing.py
Outdated
validity_column_names: Optional[List[str]] | ||
active_record_timestamp: Optional[TAnyDateTime] | ||
boundary_timestamp: Optional[TAnyDateTime] | ||
row_version_column_name: Optional[str] | ||
retire_if_absent: Optional[bool] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call this "retire_absent_rows"? then it is a bit more clear just from the naming what it does. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Changed it.
scd2
with retire_if_absent
flagscd2
with retire_absent_rows
flag
dlt/common/schema/typing.py
Outdated
validity_column_names: Optional[List[str]] | ||
active_record_timestamp: Optional[TAnyDateTime] | ||
boundary_timestamp: Optional[TAnyDateTime] | ||
row_version_column_name: Optional[str] | ||
retire_absent_rows: Optional[bool] | ||
natural_key: Optional[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no longer needed, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. Removed now.
dlt/extract/hints.py
Outdated
dict_["x-boundary-timestamp"] = md_dict["boundary_timestamp"] | ||
if "retire_absent_rows" in md_dict: | ||
dict_["x-retire-absent-rows"] = md_dict["retire_absent_rows"] | ||
if "natural_key" in md_dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also no longer needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also removed
... | ||
... | ||
``` | ||
Using this setting, records are not retired in the destination if their corresponding natural keys are not present in the source extract. This allows for incremental extracts that only contain updated records. You need to specify the natural key as `merge_key` when `retire_absent_rows` is `False`. Compound natural keys are allowed and can be specified by providing a list of column names as `merge_key`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rudolfix we are using only the merge_key as opposed to a combination of primary key and merge key here, let us know if this is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorritsandbrink @sh-rp I think there's some kind of conceptual confusion. merge_key
is correct but it is not a natural key for the source data. it is the same merge_key
we have in delete-insert
strategy.
Example 1: I load data day by day. I set the merge key to day
column. I load the same day twice. Then I retire records only from the given day.
Example 2: I set the merge key to the same value as primary (natural) key (or to the content hash if no natural key is present). Then I have the behavior as I'd set the retire_absent_rows
to False. (which is not needed anymore btw.)
Example 3: I update chunked documents and use merge key to be doc_id. Then I retire all the missing chunks for those doc ids, not touching the others.
My take:
- drop the flag
- if the merge key is present, always limit the set of records to retire
- update the documentation to explain two basic cases (do not retire deleted records at all, retire deleted records only for given partition)
WDYT? IMO this is really powerful functionality if done this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right :) I'll adapt per your suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small changes requested and one question that @rudolfix should answer.
scd2
with retire_absent_rows
flagscd2
with merge_key
@rudolfix I've adapted the code per your suggestions. Can you review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is so good! thanks!
Description
Adds
merge_key
support forscd2
to enable incremental extracts.When a
merge_key
is provided, absent records are only retired if theirmerge_key
value is present in the source extract.Related Issues
Closes #1789